{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "014d6a9c-fe02-47d0-994e-b1e725f0c54b",
   "metadata": {},
   "source": [
    "## vLLM推理框架\n",
    "\n",
    "vLLM推理框架是一种高效的推理工具，旨在加速和优化大规模语言模型的推理过程。它通过以下几个关键技术实现了这一目标：\n",
    "\n",
    "1. **PagedAttention**: PagedAttention通过将KVCache存储在块状物理显存中，并使用逻辑显存对物理显存进行复用的技术，该技术可以增加显存寻址的连续性，并降低重复显存的数量。\n",
    "2. **异步执行**: vLLM框架支持异步执行，这意味着可以在等待某些计算结果的同时，继续进行其他计算任务，从而提高整体推理效率。\n",
    "3. **Continuous Batching**: 在多batch推理中，一般伴随着短sequence生成完成后等待长sequence完成的padding问题，这些padding不仅占用了额外内存，且占用了生成时间，而通过将新的sequence填充到短sequence后面会让生成时间大大缩短。\n",
    "\n",
    "vLLM框架支持了大部分的纯文本LLM，部分多模态LLM，以及部分GPTQ和AWQ量化模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "4d80b8f0-2a31-4315-8bbb-2b368800289c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-12-23T13:25:44.378340Z",
     "iopub.status.busy": "2024-12-23T13:25:44.377842Z",
     "iopub.status.idle": "2024-12-23T13:25:48.659961Z",
     "shell.execute_reply": "2024-12-23T13:25:48.659028Z",
     "shell.execute_reply.started": "2024-12-23T13:25:44.378314Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Looking in indexes: https://mirrors.aliyun.com/pypi/simple\n",
      "Requirement already satisfied: vllm in /usr/local/lib/python3.10/site-packages (0.6.3.post1)\n",
      "Requirement already satisfied: psutil in /usr/local/lib/python3.10/site-packages (from vllm) (6.1.0)\n",
      "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/site-packages (from vllm) (0.2.0)\n",
      "Requirement already satisfied: numpy<2.0.0 in /usr/local/lib/python3.10/site-packages (from vllm) (1.26.4)\n",
      "Requirement already satisfied: requests>=2.26.0 in /usr/local/lib/python3.10/site-packages (from vllm) (2.32.3)\n",
      "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/site-packages (from vllm) (4.67.1)\n",
      "Requirement already satisfied: py-cpuinfo in /usr/local/lib/python3.10/site-packages (from vllm) (9.0.0)\n",
      "Requirement already satisfied: transformers>=4.45.2 in /usr/local/lib/python3.10/site-packages (from vllm) (4.46.3)\n",
      "Requirement already satisfied: tokenizers>=0.19.1 in /usr/local/lib/python3.10/site-packages (from vllm) (0.20.3)\n",
      "Requirement already satisfied: protobuf in /usr/local/lib/python3.10/site-packages (from vllm) (5.29.1)\n",
      "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/site-packages (from vllm) (3.11.10)\n",
      "Requirement already satisfied: openai>=1.40.0 in /usr/local/lib/python3.10/site-packages (from vllm) (1.57.0)\n",
      "Requirement already satisfied: uvicorn[standard] in /usr/local/lib/python3.10/site-packages (from vllm) (0.32.1)\n",
      "Requirement already satisfied: pydantic>=2.9 in /usr/local/lib/python3.10/site-packages (from vllm) (2.10.3)\n",
      "Requirement already satisfied: pillow in /usr/local/lib/python3.10/site-packages (from vllm) (10.4.0)\n",
      "Requirement already satisfied: prometheus-client>=0.18.0 in /usr/local/lib/python3.10/site-packages (from vllm) (0.21.1)\n",
      "Requirement already satisfied: prometheus-fastapi-instrumentator>=7.0.0 in /usr/local/lib/python3.10/site-packages (from vllm) (7.0.0)\n",
      "Requirement already satisfied: tiktoken>=0.6.0 in /usr/local/lib/python3.10/site-packages (from vllm) (0.7.0)\n",
      "Requirement already satisfied: lm-format-enforcer==0.10.6 in /usr/local/lib/python3.10/site-packages (from vllm) (0.10.6)\n",
      "Requirement already satisfied: outlines<0.1,>=0.0.43 in /usr/local/lib/python3.10/site-packages (from vllm) (0.0.46)\n",
      "Requirement already satisfied: typing-extensions>=4.10 in /usr/local/lib/python3.10/site-packages (from vllm) (4.12.2)\n",
      "Requirement already satisfied: filelock>=3.10.4 in /usr/local/lib/python3.10/site-packages (from vllm) (3.16.1)\n",
      "Requirement already satisfied: partial-json-parser in /usr/local/lib/python3.10/site-packages (from vllm) (0.2.1.1.post4)\n",
      "Requirement already satisfied: pyzmq in /usr/local/lib/python3.10/site-packages (from vllm) (26.2.0)\n",
      "Requirement already satisfied: msgspec in /usr/local/lib/python3.10/site-packages (from vllm) (0.18.6)\n",
      "Requirement already satisfied: gguf==0.10.0 in /usr/local/lib/python3.10/site-packages (from vllm) (0.10.0)\n",
      "Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.10/site-packages (from vllm) (8.5.0)\n",
      "Requirement already satisfied: mistral-common>=1.4.4 in /usr/local/lib/python3.10/site-packages (from mistral-common[opencv]>=1.4.4->vllm) (1.5.1)\n",
      "Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/site-packages (from vllm) (6.0.2)\n",
      "Requirement already satisfied: einops in /usr/local/lib/python3.10/site-packages (from vllm) (0.8.0)\n",
      "Requirement already satisfied: compressed-tensors==0.6.0 in /usr/local/lib/python3.10/site-packages (from vllm) (0.6.0)\n",
      "Requirement already satisfied: ray>=2.9 in /usr/local/lib/python3.10/site-packages (from vllm) (2.40.0)\n",
      "Requirement already satisfied: nvidia-ml-py in /usr/local/lib/python3.10/site-packages (from vllm) (12.560.30)\n",
      "Requirement already satisfied: torch==2.4.0 in /usr/local/lib/python3.10/site-packages (from vllm) (2.4.0)\n",
      "Requirement already satisfied: torchvision==0.19 in /usr/local/lib/python3.10/site-packages (from vllm) (0.19.0)\n",
      "Requirement already satisfied: xformers==0.0.27.post2 in /usr/local/lib/python3.10/site-packages (from vllm) (0.0.27.post2)\n",
      "Requirement already satisfied: fastapi!=0.113.*,!=0.114.0,>=0.107.0 in /usr/local/lib/python3.10/site-packages (from vllm) (0.115.6)\n",
      "Requirement already satisfied: interegular>=0.3.2 in /usr/local/lib/python3.10/site-packages (from lm-format-enforcer==0.10.6->vllm) (0.3.3)\n",
      "Requirement already satisfied: packaging in /usr/local/lib/python3.10/site-packages (from lm-format-enforcer==0.10.6->vllm) (24.2)\n",
      "Requirement already satisfied: sympy in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (1.13.3)\n",
      "Requirement already satisfied: networkx in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (3.4.2)\n",
      "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (3.1.4)\n",
      "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (2024.6.1)\n",
      "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (12.1.105)\n",
      "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (12.1.105)\n",
      "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (12.1.105)\n",
      "Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (9.1.0.70)\n",
      "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (12.1.3.1)\n",
      "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (11.0.2.54)\n",
      "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (10.3.2.106)\n",
      "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (11.4.5.107)\n",
      "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (12.1.0.106)\n",
      "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (2.20.5)\n",
      "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (12.1.105)\n",
      "Requirement already satisfied: triton==3.0.0 in /usr/local/lib/python3.10/site-packages (from torch==2.4.0->vllm) (3.0.0)\n",
      "Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch==2.4.0->vllm) (12.6.85)\n",
      "Requirement already satisfied: starlette<0.42.0,>=0.40.0 in /usr/local/lib/python3.10/site-packages (from fastapi!=0.113.*,!=0.114.0,>=0.107.0->vllm) (0.41.3)\n",
      "Requirement already satisfied: jsonschema<5.0.0,>=4.21.1 in /usr/local/lib/python3.10/site-packages (from mistral-common>=1.4.4->mistral-common[opencv]>=1.4.4->vllm) (4.23.0)\n",
      "Requirement already satisfied: opencv-python-headless<5.0.0,>=4.0.0 in /usr/local/lib/python3.10/site-packages (from mistral-common[opencv]>=1.4.4->vllm) (4.10.0.84)\n",
      "Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/site-packages (from openai>=1.40.0->vllm) (4.7.0)\n",
      "Requirement already satisfied: distro<2,>=1.7.0 in /usr/local/lib/python3.10/site-packages (from openai>=1.40.0->vllm) (1.9.0)\n",
      "Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.10/site-packages (from openai>=1.40.0->vllm) (0.27.2)\n",
      "Requirement already satisfied: jiter<1,>=0.4.0 in /usr/local/lib/python3.10/site-packages (from openai>=1.40.0->vllm) (0.8.0)\n",
      "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/site-packages (from openai>=1.40.0->vllm) (1.3.1)\n",
      "Requirement already satisfied: lark in /usr/local/lib/python3.10/site-packages (from outlines<0.1,>=0.0.43->vllm) (1.2.2)\n",
      "Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.10/site-packages (from outlines<0.1,>=0.0.43->vllm) (1.6.0)\n",
      "Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/site-packages (from outlines<0.1,>=0.0.43->vllm) (3.1.0)\n",
      "Requirement already satisfied: diskcache in /usr/local/lib/python3.10/site-packages (from outlines<0.1,>=0.0.43->vllm) (5.6.3)\n",
      "Requirement already satisfied: numba in /usr/local/lib/python3.10/site-packages (from outlines<0.1,>=0.0.43->vllm) (0.60.0)\n",
      "Requirement already satisfied: referencing in /usr/local/lib/python3.10/site-packages (from outlines<0.1,>=0.0.43->vllm) (0.35.1)\n",
      "Requirement already satisfied: datasets in /usr/local/lib/python3.10/site-packages (from outlines<0.1,>=0.0.43->vllm) (3.0.1)\n",
      "Requirement already satisfied: pycountry in /usr/local/lib/python3.10/site-packages (from outlines<0.1,>=0.0.43->vllm) (24.6.1)\n",
      "Requirement already satisfied: pyairports in /usr/local/lib/python3.10/site-packages (from outlines<0.1,>=0.0.43->vllm) (2.1.1)\n",
      "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/site-packages (from pydantic>=2.9->vllm) (0.7.0)\n",
      "Requirement already satisfied: pydantic-core==2.27.1 in /usr/local/lib/python3.10/site-packages (from pydantic>=2.9->vllm) (2.27.1)\n",
      "Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.10/site-packages (from ray>=2.9->vllm) (8.1.7)\n",
      "Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/site-packages (from ray>=2.9->vllm) (1.1.0)\n",
      "Requirement already satisfied: aiosignal in /usr/local/lib/python3.10/site-packages (from ray>=2.9->vllm) (1.3.1)\n",
      "Requirement already satisfied: frozenlist in /usr/local/lib/python3.10/site-packages (from ray>=2.9->vllm) (1.5.0)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/site-packages (from requests>=2.26.0->vllm) (3.4.0)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/site-packages (from requests>=2.26.0->vllm) (3.10)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/site-packages (from requests>=2.26.0->vllm) (2.2.3)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/site-packages (from requests>=2.26.0->vllm) (2024.8.30)\n",
      "Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.10/site-packages (from tiktoken>=0.6.0->vllm) (2024.11.6)\n",
      "Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/site-packages (from tokenizers>=0.19.1->vllm) (0.25.2)\n",
      "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/site-packages (from transformers>=4.45.2->vllm) (0.4.5)\n",
      "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.10/site-packages (from aiohttp->vllm) (2.4.4)\n",
      "Requirement already satisfied: async-timeout<6.0,>=4.0 in /usr/local/lib/python3.10/site-packages (from aiohttp->vllm) (4.0.3)\n",
      "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/site-packages (from aiohttp->vllm) (24.2.0)\n",
      "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/site-packages (from aiohttp->vllm) (6.1.0)\n",
      "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.10/site-packages (from aiohttp->vllm) (0.2.1)\n",
      "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.10/site-packages (from aiohttp->vllm) (1.18.3)\n",
      "Requirement already satisfied: zipp>=3.20 in /usr/local/lib/python3.10/site-packages (from importlib-metadata->vllm) (3.21.0)\n",
      "Requirement already satisfied: h11>=0.8 in /usr/local/lib/python3.10/site-packages (from uvicorn[standard]->vllm) (0.14.0)\n",
      "Requirement already satisfied: httptools>=0.6.3 in /usr/local/lib/python3.10/site-packages (from uvicorn[standard]->vllm) (0.6.4)\n",
      "Requirement already satisfied: python-dotenv>=0.13 in /usr/local/lib/python3.10/site-packages (from uvicorn[standard]->vllm) (1.0.1)\n",
      "Requirement already satisfied: uvloop!=0.15.0,!=0.15.1,>=0.14.0 in /usr/local/lib/python3.10/site-packages (from uvicorn[standard]->vllm) (0.21.0)\n",
      "Requirement already satisfied: watchfiles>=0.13 in /usr/local/lib/python3.10/site-packages (from uvicorn[standard]->vllm) (1.0.0)\n",
      "Requirement already satisfied: websockets>=10.4 in /usr/local/lib/python3.10/site-packages (from uvicorn[standard]->vllm) (11.0.3)\n",
      "Requirement already satisfied: exceptiongroup>=1.0.2 in /usr/local/lib/python3.10/site-packages (from anyio<5,>=3.5.0->openai>=1.40.0->vllm) (1.2.2)\n",
      "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/site-packages (from httpx<1,>=0.23.0->openai>=1.40.0->vllm) (1.0.7)\n",
      "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/site-packages (from jsonschema<5.0.0,>=4.21.1->mistral-common>=1.4.4->mistral-common[opencv]>=1.4.4->vllm) (2024.10.1)\n",
      "Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/site-packages (from jsonschema<5.0.0,>=4.21.1->mistral-common>=1.4.4->mistral-common[opencv]>=1.4.4->vllm) (0.22.3)\n",
      "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.10/site-packages (from datasets->outlines<0.1,>=0.0.43->vllm) (18.1.0)\n",
      "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/site-packages (from datasets->outlines<0.1,>=0.0.43->vllm) (0.3.8)\n",
      "Requirement already satisfied: pandas in /usr/local/lib/python3.10/site-packages (from datasets->outlines<0.1,>=0.0.43->vllm) (2.2.3)\n",
      "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/site-packages (from datasets->outlines<0.1,>=0.0.43->vllm) (3.5.0)\n",
      "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/site-packages (from datasets->outlines<0.1,>=0.0.43->vllm) (0.70.16)\n",
      "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/site-packages (from jinja2->torch==2.4.0->vllm) (2.1.5)\n",
      "Requirement already satisfied: llvmlite<0.44,>=0.43.0dev0 in /usr/local/lib/python3.10/site-packages (from numba->outlines<0.1,>=0.0.43->vllm) (0.43.0)\n",
      "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/site-packages (from sympy->torch==2.4.0->vllm) (1.3.0)\n",
      "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/site-packages (from pandas->datasets->outlines<0.1,>=0.0.43->vllm) (2.9.0.post0)\n",
      "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/site-packages (from pandas->datasets->outlines<0.1,>=0.0.43->vllm) (2024.2)\n",
      "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/site-packages (from pandas->datasets->outlines<0.1,>=0.0.43->vllm) (2024.2)\n",
      "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas->datasets->outlines<0.1,>=0.0.43->vllm) (1.17.0)\n",
      "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
      "\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.3.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.3.1\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "# 安装vLLM只需要执行下面的命令\n",
    "!pip install vllm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "0cd854f3-d877-4228-8ae3-ebbed6929f67",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-12-23T13:25:55.961095Z",
     "iopub.status.busy": "2024-12-23T13:25:55.960554Z",
     "iopub.status.idle": "2024-12-23T13:25:56.090921Z",
     "shell.execute_reply": "2024-12-23T13:25:56.090088Z",
     "shell.execute_reply.started": "2024-12-23T13:25:55.961071Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# 使用下面的环境变量使用ModelScope社区来进行下载提速\n",
    "!export VLLM_USE_MODELSCOPE=1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "354fd96b-9c82-4607-a3a9-88ca63c8e558",
   "metadata": {},
   "outputs": [],
   "source": [
    "需要注意的是，vLLM会预占用大量显存来存储KVCache，显存占用越大速度提升越高。如果需要控制显存占用的量，请使用下面的参数：\n",
    "\n",
    "- gpu_memory_utilization: 从0-1的float小数，默认为0.9，代表了额外显存的占用量\n",
    "\n",
    "除此之外，还有下面的参数经常被用到：\n",
    "\n",
    "- tensor_parallel_size tensor并行数量，如果你有多个显卡可以用这个参数来拆分模型\n",
    "- pipeline_parallel_size pipeline并行数量\n",
    "- max_num_seqs 并行处理的最大sequence数量\n",
    "\n",
    "更多分布式推理的参数请查看vLLM的官方文档：https://docs.vllm.ai/en/latest/serving/distributed_serving.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e55cfa34-88da-45b1-bb6a-31a83c05437e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-12-23T13:40:58.371721Z",
     "iopub.status.busy": "2024-12-23T13:40:58.371110Z",
     "iopub.status.idle": "2024-12-23T13:41:41.311904Z",
     "shell.execute_reply": "2024-12-23T13:41:41.311344Z",
     "shell.execute_reply.started": "2024-12-23T13:40:58.371674Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/usr/local/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n",
      "2024-12-23 21:41:01,356\tINFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading Model to directory: /mnt/workspace/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-12-23 21:41:02,605 - modelscope - INFO - Target directory already exists, skipping creation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading Model to directory: /mnt/workspace/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-12-23 21:41:03,056 - modelscope - INFO - Target directory already exists, skipping creation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 12-23 21:41:07 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)\n",
      "Downloading Model to directory: /mnt/workspace/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-12-23 21:41:07,591 - modelscope - INFO - Target directory already exists, skipping creation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading Model to directory: /mnt/workspace/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-12-23 21:41:18,664 - modelscope - INFO - Target directory already exists, skipping creation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading Model to directory: /mnt/workspace/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-12-23 21:41:19,175 - modelscope - INFO - Target directory already exists, skipping creation.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 12-23 21:41:19 model_runner.py:1056] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...\n",
      "Downloading Model to directory: /mnt/workspace/.cache/modelscope/hub/Qwen/Qwen2.5-1.5B-Instruct\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-12-23 21:41:20,154 - modelscope - INFO - Target directory already exists, skipping creation.\n",
      "Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]\n",
      "Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.11s/it]\n",
      "Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.11s/it]\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO 12-23 21:41:21 model_runner.py:1067] Loading model weights took 2.8875 GB\n",
      "INFO 12-23 21:41:22 gpu_executor.py:122] # GPU blocks: 154829, # CPU blocks: 9362\n",
      "INFO 12-23 21:41:22 gpu_executor.py:126] Maximum concurrency for 32768 tokens per request: 75.60x\n",
      "INFO 12-23 21:41:24 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n",
      "INFO 12-23 21:41:24 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\n",
      "INFO 12-23 21:41:38 model_runner.py:1523] Graph capturing finished in 14 secs.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processed prompts: 100%|██████████| 4/4 [00:00<00:00, 29.34it/s, est. speed input: 161.46 toks/s, output: 469.66 toks/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prompt: 'Hello, my name is', Generated text: ' Kofi. I am a very passionate and driven individual who has a strong desire'\n",
      "Prompt: 'The president of the United States is', Generated text: ' the head of state and the head of government, and he or she is responsible'\n",
      "Prompt: 'The capital of France is', Generated text: ' a city called Paris, and it is the capital of the Republic of France.'\n",
      "Prompt: 'The future of AI is', Generated text: ' bright and growing in power, but more importantly, it’s growing in ethical considerations'\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "# 这个例子来自于vLLM官方\n",
    "from vllm import LLM, SamplingParams\n",
    "prompts = [\n",
    "    \"Hello, my name is\",\n",
    "    \"The president of the United States is\",\n",
    "    \"The capital of France is\",\n",
    "    \"The future of AI is\",\n",
    "]\n",
    "sampling_params = SamplingParams(temperature=0.8, top_p=0.95)\n",
    "llm = LLM(model=\"Qwen/Qwen2.5-1.5B-Instruct\")\n",
    "outputs = llm.generate(prompts, sampling_params)\n",
    "\n",
    "for output in outputs:\n",
    "    prompt = output.prompt\n",
    "    generated_text = output.outputs[0].text\n",
    "    print(f\"Prompt: {prompt!r}, Generated text: {generated_text!r}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "617b1144-7adb-43e9-bf48-8e142007935a",
   "metadata": {},
   "source": [
    "vLLM也支持直接部署，以便用户使用OpenAI格式进行访问："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9137c1e2-e62d-4d2b-890b-e5f4aed3af68",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# 在terminal中运行，否则导致下面的client代码等待\n",
    "!vllm serve Qwen/Qwen2.5-1.5B-Instruct"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ca32b6c-6641-4991-81c8-e965375bcdf0",
   "metadata": {},
   "source": [
    "下面我们对这个server进行调用，下面列举了三个例子：\n",
    "1. 查看模型列表\n",
    "2. curl方式的调用\n",
    "3. openai包方式调用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "da33bd18-04f3-45ef-a397-38303fb12d10",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-12-23T13:35:25.528124Z",
     "iopub.status.busy": "2024-12-23T13:35:25.527463Z",
     "iopub.status.idle": "2024-12-23T13:35:25.643576Z",
     "shell.execute_reply": "2024-12-23T13:35:25.642988Z",
     "shell.execute_reply.started": "2024-12-23T13:35:25.528090Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"object\":\"list\",\"data\":[{\"id\":\"Qwen/Qwen2.5-1.5B-Instruct\",\"object\":\"model\",\"created\":1734960925,\"owned_by\":\"vllm\",\"root\":\"Qwen/Qwen2.5-1.5B-Instruct\",\"parent\":null,\"max_model_len\":32768,\"permission\":[{\"id\":\"modelperm-196cd2277ec042d0b5879a06c38a21c5\",\"object\":\"model_permission\",\"created\":1734960925,\"allow_create_engine\":false,\"allow_sampling\":true,\"allow_logprobs\":true,\"allow_search_indices\":false,\"allow_view\":true,\"allow_fine_tuning\":false,\"organization\":\"*\",\"group\":null,\"is_blocking\":false}]}]}"
     ]
    }
   ],
   "source": [
    "# 查看vLLM模型\n",
    "!curl http://localhost:8000/v1/models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "b5869b9e-0df3-45ea-9ccb-16459c8c583d",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-12-23T13:36:48.051772Z",
     "iopub.status.busy": "2024-12-23T13:36:48.050936Z",
     "iopub.status.idle": "2024-12-23T13:36:48.322141Z",
     "shell.execute_reply": "2024-12-23T13:36:48.321384Z",
     "shell.execute_reply.started": "2024-12-23T13:36:48.051746Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"id\":\"chat-b4888f4f2c9c4ca0aab48a4bd53859ec\",\"object\":\"chat.completion\",\"created\":1734961008,\"model\":\"Qwen/Qwen2.5-1.5B-Instruct\",\"choices\":[{\"index\":0,\"message\":{\"role\":\"assistant\",\"content\":\"The New York Yankees won the World Series in 2020.\",\"tool_calls\":[]},\"logprobs\":null,\"finish_reason\":\"stop\",\"stop_reason\":null}],\"usage\":{\"prompt_tokens\":31,\"total_tokens\":47,\"completion_tokens\":16},\"prompt_logprobs\":null}"
     ]
    }
   ],
   "source": [
    "# curl类型 client代码\n",
    "!curl http://localhost:8000/v1/chat/completions \\\n",
    "    -H \"Content-Type: application/json\" \\\n",
    "    -d '{ \\\n",
    "    \"model\": \"Qwen/Qwen2.5-1.5B-Instruct\", \\\n",
    "    \"messages\": [ \\\n",
    "    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"}, \\\n",
    "    {\"role\": \"user\", \"content\": \"Who won the world series in 2020?\"} \\\n",
    "    ] \\\n",
    "    }'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "8ab93d07-d378-4aa7-9b58-943319191cd8",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-12-23T13:38:33.830246Z",
     "iopub.status.busy": "2024-12-23T13:38:33.829704Z",
     "iopub.status.idle": "2024-12-23T13:38:38.046731Z",
     "shell.execute_reply": "2024-12-23T13:38:38.046019Z",
     "shell.execute_reply.started": "2024-12-23T13:38:33.830219Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Looking in indexes: https://mirrors.aliyun.com/pypi/simple\n",
      "Requirement already satisfied: openai in /usr/local/lib/python3.10/site-packages (1.57.0)\n",
      "Collecting openai\n",
      "  Downloading https://mirrors.aliyun.com/pypi/packages/8e/5a/d22cd07f1a99b9e8b3c92ee0c1959188db4318828a3d88c9daac120bdd69/openai-1.58.1-py3-none-any.whl (454 kB)\n",
      "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m454.3/454.3 kB\u001b[0m \u001b[31m9.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
      "\u001b[?25hRequirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/site-packages (from openai) (4.7.0)\n",
      "Requirement already satisfied: distro<2,>=1.7.0 in /usr/local/lib/python3.10/site-packages (from openai) (1.9.0)\n",
      "Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.10/site-packages (from openai) (0.27.2)\n",
      "Requirement already satisfied: jiter<1,>=0.4.0 in /usr/local/lib/python3.10/site-packages (from openai) (0.8.0)\n",
      "Requirement already satisfied: pydantic<3,>=1.9.0 in /usr/local/lib/python3.10/site-packages (from openai) (2.10.3)\n",
      "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/site-packages (from openai) (1.3.1)\n",
      "Requirement already satisfied: tqdm>4 in /usr/local/lib/python3.10/site-packages (from openai) (4.67.1)\n",
      "Requirement already satisfied: typing-extensions<5,>=4.11 in /usr/local/lib/python3.10/site-packages (from openai) (4.12.2)\n",
      "Requirement already satisfied: exceptiongroup>=1.0.2 in /usr/local/lib/python3.10/site-packages (from anyio<5,>=3.5.0->openai) (1.2.2)\n",
      "Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.10/site-packages (from anyio<5,>=3.5.0->openai) (3.10)\n",
      "Requirement already satisfied: certifi in /usr/local/lib/python3.10/site-packages (from httpx<1,>=0.23.0->openai) (2024.8.30)\n",
      "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/site-packages (from httpx<1,>=0.23.0->openai) (1.0.7)\n",
      "Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.10/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.14.0)\n",
      "Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->openai) (0.7.0)\n",
      "Requirement already satisfied: pydantic-core==2.27.1 in /usr/local/lib/python3.10/site-packages (from pydantic<3,>=1.9.0->openai) (2.27.1)\n",
      "Installing collected packages: openai\n",
      "  Attempting uninstall: openai\n",
      "    Found existing installation: openai 1.57.0\n",
      "    Uninstalling openai-1.57.0:\n",
      "      Successfully uninstalled openai-1.57.0\n",
      "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
      "lmdeploy 0.6.2 requires peft<=0.11.1, but you have peft 0.12.0 which is incompatible.\u001b[0m\u001b[31m\n",
      "\u001b[0mSuccessfully installed openai-1.58.1\n",
      "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\u001b[33m\n",
      "\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.3.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.3.1\u001b[0m\n",
      "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
      "Chat response: ChatCompletion(id='chat-8a3920662df64279ba847ff1bb0b8907', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=\"Why don't scientists trust atoms?\\n\\nBecause they make up everything.\", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[]), stop_reason=None)], created=1734961117, model='Qwen/Qwen2.5-1.5B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=14, prompt_tokens=24, total_tokens=38, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)\n"
     ]
    }
   ],
   "source": [
    "# python调用\n",
    "!pip install openai -U\n",
    "from openai import OpenAI\n",
    "# Set OpenAI's API key and API base to use vLLM's API server.\n",
    "openai_api_key = \"EMPTY\"\n",
    "openai_api_base = \"http://localhost:8000/v1\"\n",
    "\n",
    "client = OpenAI(\n",
    "    api_key=openai_api_key,\n",
    "    base_url=openai_api_base,\n",
    ")\n",
    "\n",
    "chat_response = client.chat.completions.create(\n",
    "    model=\"Qwen/Qwen2.5-1.5B-Instruct\",\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
    "        {\"role\": \"user\", \"content\": \"Tell me a joke.\"},\n",
    "    ]\n",
    ")\n",
    "print(\"Chat response:\", chat_response)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
